Skip to content

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean)#593

Closed
abaybektursun wants to merge 5 commits intoopenai:mainfrom
abaybektursun:submission/full-gptq-leakyrelu-1.1170
Closed

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean)#593
abaybektursun wants to merge 5 commits intoopenai:mainfrom
abaybektursun:submission/full-gptq-leakyrelu-1.1170

Conversation

@abaybektursun
Copy link
Copy Markdown
Contributor

@abaybektursun abaybektursun commented Mar 24, 2026

Summary

  • val_bpb: 1.1163 (3-seed mean, std 0.0012) | ~15.90 MB | 8×H100 SXM, 600s | No TTT

3-Seed Results

Seed step_avg steps Pre-quant bpb Post-GPTQ sliding bpb Artifact
42 83.4ms 7,192 1.1349 1.1149 15,895,636
1337 83.4ms 7,195 1.1370 1.1172 15,899,284
2024 83.5ms 7,190 1.1367 1.1167 15,904,036
Mean 83.4ms 7,192 1.1362 1.1163 (std 0.0012)

Key Techniques

Full Hessian GPTQ. Second-order Hessian-aware quantization with Cholesky error compensation and column reordering (actorder). GPTQ improves post-quantization BPB by 0.0199 vs pre-quantization.

Parallel Muon optimizer. Parameter Banking (batched Newton-Schulz, 15× faster optimizer) + async reduce-scatter/all-gather communication overlap. 83ms/step vs ~89ms baseline.

BigramHash 3072×80. Budget-optimal allocation of the 16MB artifact limit — more hash buckets with narrower embeddings. Coverage beats fidelity: each bigram embedding passes through a learned 80→512 projection that reconstructs useful features from narrower input, while doubling buckets from 1536→3072 halves hash collisions, capturing more unique bigram patterns. Narrower embeddings also compress better under GPTQ+lzma (random-looking vectors have high entropy), freeing bytes for the larger table.

Architecture

Component Setting
Layers 11 (512d, 8H, 4KV)
MLP 3× with LeakyReLU(0.5)²
BigramHash 3072 buckets, dim=80
XSA Last 4 layers
RoPE Partial (16/64 dims)
LN Scale 1/√(layer+1)
VE128 Layers 9-10
Weight avg EMA(0.997) + Tight SWA(every 50)
Quantization Full Hessian GPTQ int6 + lzma(9)
Optimizer Parameter Banking + Parallel Muon

Credits

🤖 Generated with Claude Code

…ed mean)

Full Hessian GPTQ (Cholesky error compensation, actorder) replaces
GPTQ-lite, improving post-quant BPB by 0.0048. LeakyReLU(0.5)² activation.
No TTT needed.

3-seed results:
  Seed 2025: 1.1167 bpb, 15.90 MB
  Seed 1337: 1.1171 bpb, 15.96 MB
  Seed 2024: 1.1173 bpb, 15.99 MB
  Mean:      1.1170 (std 0.0003)

All artifacts under 16MB. Eval ~185s (well within 10 min).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun changed the title Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1170 (3-seed mean, no TTT) Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1171 (3-seed mean, no TTT) Mar 24, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026
Novel: XSA-all(11) + selective ±1 pruning on openai#593 base.
Parallel Muon gives 83ms/step = competition-grade throughput.
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026
XSA-all(11) on openai#593 Parallel Muon stack. 83ms/step, 6923 steps.
15.94MB fits. Novel contribution: XSA-all + selective pruning.
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026
Beats merged openai#414 by 0.0073 nats. Meets record threshold.
Stack: openai#593 + XSA-all(11) + selective ±1 pruning.
Ready for submission.
BigramHash 1536×128 → 3072×80: coverage-over-fidelity budget reallocation.
More hash buckets capture more bigram patterns; narrower embeddings compress
better under GPTQ+lzma, freeing bytes for the larger table.

GPTQ memory fix: free training model before Hessian calibration to prevent
OOM with the larger BigramHash optimizer state.

3-seed results: 1.1149, 1.1172, 1.1167 (mean 1.1163, std 0.0012)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun changed the title Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1171 (3-seed mean, no TTT) Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) Mar 24, 2026
@valerio-oai
Copy link
Copy Markdown
Contributor

As per your table, you are counting the GPTQ calibration as an eval-time intervention. However, your implementation reuses training data for it, meaning it accesses training data at eval time, which is forbidden. Closing for now: if you want this calibration to count as training, it should be counted as part of the training 600s-budget, not the eval budget.

vimeto added a commit to vimeto/parameter-golf that referenced this pull request Mar 24, 2026
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 25, 2026
…xH100

30+ experiments on the PR openai#593 stack (1.1171 BPB), all negative or marginal:
- CUTLASS SM90 GEMM: 2.5x slower than cuBLAS
- Fused Triton GEMM+activation: autograd.Function kills backward
- FP8, QKV fusion, custom CUDA: all slower or no improvement
- SpinQuant, mixed int5/int8, Soft-Round QAT: noise-level
- XSA-all, VRL, Gated Attention, bigger model, shard ordering: all worse
- 22 legal TTT experiments: all worse than non-TTT baseline

Key finding: 82ms step is 95%+ optimized. torch.compile handles all fusion.
Competition at d=512 is bits-per-parameter, not FLOPS-per-second.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants